-
Notifications
You must be signed in to change notification settings - Fork 35
feat: add event handling strategy changes in csp-health-monitor #641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in csp-health-monitor #641
Conversation
|
Warning Rate limit exceeded@tanishagoyal2 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 35 seconds before requesting another review. β How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. π¦ How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. π Files selected for processing (4)
π WalkthroughWalkthroughAdds a configurable processing strategy (EXECUTE_REMEDIATION or STORE_ONLY): Helm value and deployment flag, CLI parsing/validation, threaded the strategy into the trigger engine so emitted pb.HealthEvent includes it, and tests/helpers to verify STORE_ONLY behavior. Changes
Sequence DiagramsequenceDiagram
participant Helm as Helm values
participant Deployment as Deployment YAML
participant CLI as maintenance-notifier (main)
participant Engine as Trigger Engine
participant Store as Datastore
participant Exporter as Event Exporter
Helm->>Deployment: supply processingStrategy value
Deployment->>CLI: container started with --processing-strategy
CLI->>CLI: validate value against protobuf enum
CLI->>Engine: NewEngine(..., processingStrategy)
Engine->>Engine: store processingStrategy on Engine
Engine->>Store: persist HealthEvent (includes processingStrategy)
Engine->>Exporter: emit HealthEvent/CloudEvent (includes processingStrategy)
Exporter->>Exporter: export semantics reflect processingStrategy
Estimated code review effortπ― 3 (Moderate) | β±οΈ ~22 minutes Poem
π₯ Pre-merge checks | β 2 | β 1β Failed checks (1 warning)
β Passed checks (2 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π§Ή Nitpick comments (2)
tests/helpers/kube.go (1)
2338-2393: Minor: Docstring mentions environment variables instead of arguments.The function docstring says "removes environment variables" but this function removes container arguments.
Suggested fix for docstring
-// RemoveDeploymentArgs removes environment variables from containers in a deployment. +// RemoveDeploymentArgs removes arguments from containers in a deployment. // If containerName is empty, removes from all containers. Otherwise, removes only from the named container. // Uses retry.RetryOnConflict for automatic retry handling.tests/csp_health_monitor_test.go (1)
522-535: Consider using the return value of TeardownCSPHealthMonitorTest.The teardown correctly restores the processing strategy, but
TeardownCSPHealthMonitorTestreturns a context that is discarded. Other tests in this file return the result of this call. While this may be intentional, it should be consistent with other teardown patterns.- helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx) - - return ctx + return helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx)
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (23)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/csp-health-monitor/cmd/maintenance-notifier/main.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyiplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/csp_health_monitor_test.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (5)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/healthevent.gostore-client/pkg/client/pipeline_builder_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gotests/event_exporter_test.gostore-client/pkg/client/pipeline_builder.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/helpers/kube.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.goevent-exporter/pkg/transformer/cloudevents.gohealth-monitors/csp-health-monitor/cmd/maintenance-notifier/main.gotests/csp_health_monitor_test.gofault-quarantine/pkg/initializer/init.gotests/helpers/event_exporter.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
store-client/pkg/client/pipeline_builder_test.gotests/event_exporter_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gotests/csp_health_monitor_test.go
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (7)
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
store-client/pkg/client/pipeline_builder_test.gotests/event_exporter_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gotests/csp_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments
Applied to files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
𧬠Code graph analysis (10)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(257-283)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
data-models/pkg/protos/health_event.pb.go (19)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)Entity(208-214)Entity(227-227)Entity(242-244)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)HealthEvents(156-162)HealthEvents(175-175)HealthEvents(190-192)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
NewK8sConnector(47-58)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
NewEngine(70-86)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
data-models/pkg/protos/health_event_grpc.pb.go (1)
PlatformConnectorClient(43-45)data-models/pkg/protos/health_event.pb.go (8)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
NewEngine(70-86)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
π Additional comments (31)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM!The addition of
processingStrategywithfloat64(0)correctly reflects the default protobuf enum value (EXECUTE_REMEDIATION = 0) after JSON serialization, which represents numbers as float64.tests/helpers/healthevent.go (2)
48-48: LGTM!The new
ProcessingStrategyfield follows the same pattern asRecommendedAction(also an int with omitempty), maintaining consistency in the test helper struct.
153-156: LGTM!The fluent setter follows the established builder pattern used throughout this file.
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml (1)
52-57: Well documented configuration option.The inline comments clearly explain the valid values and their behavioral implications, following the Helm chart documentation guidelines. The default
EXECUTE_REMEDIATIONmaintains backward compatibility.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (3)
325-343: LGTM - Clean filtering implementation.The
filterProcessableEventsfunction correctly filters outSTORE_ONLYevents with appropriate logging for observability. The function is well-structured and follows Go conventions.
345-370: Good refactor - centralizes K8s event creation.The
createK8sEventhelper consolidates event construction logic that was previously inline, improving maintainability and reducing code duplication.
372-416: LGTM - Correctly updated to use filtered events.The
processHealthEventsfunction properly filters events before processing and uses the filtered slice (processableEvents) consistently for both node condition updates and K8s event creation.tests/helpers/kube.go (2)
2207-2249: LGTM - Well-implemented DaemonSet rollout helper.The function follows the same pattern as
WaitForDeploymentRollout, with appropriate status checks and logging. It correctly validates that all pods are scheduled, updated, and ready.
2251-2336: LGTM - Comprehensive argument handling.The
SetDeploymentArgsandsetArgsOnContainerfunctions properly handle multiple argument formats (--flag=value,--flag value, and boolean flags). The retry-on-conflict pattern is correctly applied.event-exporter/pkg/transformer/cloudevents.go (1)
66-66: LGTM!The addition of
processingStrategyfollows the same pattern asrecommendedAction(line 61), using the protobuf enum's.String()method for consistent serialization in the CloudEvent payload.distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml (1)
161-161: LGTM!The new
--processing-strategyargument correctly references the Helm value defined invalues.yaml, maintaining consistency with other configuration options passed to the maintenance-notifier container.Consider whether invalid values should fail at Helm rendering time rather than runtime. You could add a validation check in the template:
{{- if not (or (eq .Values.processingStrategy "EXECUTE_REMEDIATION") (eq .Values.processingStrategy "STORE_ONLY")) }} {{- fail "processingStrategy must be either EXECUTE_REMEDIATION or STORE_ONLY" }} {{- end }}This would catch typos during
helm install/upgraderather than at container startup.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM - Correct pipeline selection for fault-quarantine.Switching to
BuildProcessableHealthEventInsertsPipeline()ensures the fault-quarantine module only processes health events withEXECUTE_REMEDIATIONstrategy, filtering outSTORE_ONLYevents at the data layer. This aligns with the PR objective of allowing observability-only event handling.store-client/pkg/client/pipeline_builder_test.go (1)
69-86: LGTM!The test follows the established table-driven pattern and validates the new pipeline method across both MongoDB and PostgreSQL builders. The assertions are appropriate for verifying pipeline structure.
tests/event_exporter_test.go (1)
85-85: LGTM!The test correctly validates the new
processingStrategyfield with the expected value "EXECUTE_REMEDIATION".tests/helpers/event_exporter.go (2)
220-254: LGTM!The new helper function follows the established pattern of
FindEventByNodeAndMessageand provides a useful search capability for CloudEvents by node name, check name, and health status.
256-283: LGTM!The extended validation correctly asserts the new
processingStrategyfield in CloudEvents, maintaining consistency with the protobuf schema changes.store-client/pkg/client/postgresql_pipeline_builder.go (1)
119-132: LGTM!The pipeline correctly filters for INSERT operations with
processingStrategy=EXECUTE_REMEDIATION. The implementation aligns with the existing pipeline patterns in the file.event-exporter/pkg/transformer/cloudevents_test.go (1)
69-69: LGTM!The test correctly validates that the
ProcessingStrategyfield is properly serialized to CloudEvents. The use oft.Errorffor assertions aligns with the coding guidelines for simple equality checks.Also applies to: 106-108
store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM!The new interface method is well-documented with clear usage guidance. The godoc comment effectively explains the filtering behavior for processing-strategy aware pipelines.
data-models/protobufs/health_event.proto (2)
32-38: LGTM!The
ProcessingStrategyenum is well-documented with clear semantic definitions. UsingEXECUTE_REMEDIATION = 0as the default value ensures backward compatibilityβexisting events without this field will automatically have execution behavior, which is the expected default.
77-77: LGTM!The new field follows protobuf best practices with sequential field numbering and uses the appropriate enum type.
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (1)
193-193: LGTM!All test cases correctly updated to pass
pb.ProcessingStrategy_EXECUTE_REMEDIATIONto theNewEngineconstructor, maintaining consistency with the new signature.Also applies to: 207-207, 612-612, 793-793, 839-839
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
61-61: LGTM! CLI flag and configuration integration looks correct.The new
--processing-strategyflag with defaultEXECUTE_REMEDIATIONfollows the existing flag patterns in this file. The help text clearly documents the valid options.Also applies to: 75-76
219-227: Validation logic is correct and idiomatic.Using
pb.ProcessingStrategy_valuemap for validation ensures only valid protobuf enum values are accepted. The error message is clear about expected values. The cast topb.ProcessingStrategy(value)is safe sincevaluewas validated.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
1391-1506: Comprehensive test coverage for STORE_ONLY processing strategy.The test cases thoroughly cover the expected behavior:
- STORE_ONLY events should not create node conditions or K8s events
- EXECUTE_REMEDIATION events should create appropriate conditions
- Mixed batches correctly process only EXECUTE_REMEDIATION events
The table-driven approach with descriptive names follows Go testing best practices.
1550-1562: Good approach to isolating NVSentinel-specific conditions.The filtering logic correctly excludes all standard Kubernetes node condition types before counting. This ensures the test validates only conditions created by the health event processing.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Auto-generated protobuf file - no manual review needed.This file is generated by the protocol buffer compiler (as noted in line 2). The changes reflect the addition of the
ProcessingStrategyenum to the proto schema. Ensure this was regenerated from the updated.protosource rather than manually edited.health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)
58-67: Engine struct correctly extended with processingStrategy field.The new field follows the existing struct field patterns. This design means the processing strategy is configured at the engine level (via CLI flag) and applies uniformly to all emitted health events.
69-86: Constructor updated correctly with new parameter.The
processingStrategyparameter is properly added and stored. This is a breaking API change - callers must now provide the strategy. Per the AI summary, tests have been updated accordingly.
348-366: ProcessingStrategy correctly propagated to health events.The
ProcessingStrategyfield is populated from the engine's configured value, ensuring all health events from this engine instance carry the same strategy. This aligns with the CLI-based configuration approach.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
14-18: Auto-generated Python stub file correctly reflects ProcessingStrategy additions.The
.pyistub file properly exposes:
ProcessingStrategyenum class withEXECUTE_REMEDIATIONandSTORE_ONLYvalues- Module-level enum constants
processingStrategyfield inHealthEventwith correct typingThis is generated code that mirrors the proto schema changes.
Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
61f34e7 to
d314731
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
348-369: Critical bug:ProcessingStrategyis set twice, hardcoded value overrides configured value.The
ProcessingStrategyfield is assigned twice in this struct literal:
- Line 354:
ProcessingStrategy: e.processingStrategy(correct, uses configured value)- Line 368:
ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATION(hardcoded)In Go struct literals, the last assignment wins, so the hardcoded
EXECUTE_REMEDIATIONwill always override the configurede.processingStrategy, making the entire feature non-functional. The TODO comment references this PR but the hardcoded line should be removed.π§ Proposed fix - remove duplicate assignment and stale TODO
healthEvent := &pb.HealthEvent{ Agent: "csp-health-monitor", // Consistent agent name ComponentClass: event.ResourceType, // e.g., "EC2", "gce_instance" CheckName: "CSPMaintenance", // Consistent check name IsFatal: isFatal, IsHealthy: isHealthy, ProcessingStrategy: e.processingStrategy, Message: message, RecommendedAction: pb.RecommendedAction(actionEnum), EntitiesImpacted: []*pb.Entity{ { EntityType: event.ResourceType, EntityValue: event.ResourceID, // CSP's ID (e.g., instance-id, full gcp resource name) }, }, Metadata: event.Metadata, // Pass along metadata NodeName: event.NodeName, // K8s node name GeneratedTimestamp: timestamppb.New(time.Now()), - // TODO: Remove hardcoded processing strategy and make it configurable via the config file. - // PR: https://github.com/NVIDIA/NVSentinel/pull/641 - ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATION, }
π€ Fix all issues with AI agents
In @tests/helpers/kube.go:
- Around line 2362-2364: The function docstring for RemoveDeploymentArgs
incorrectly says it "removes environment variables"; update the comment to
accurately describe that RemoveDeploymentArgs removes container command-line
arguments (args) from a deployment, specifying that if containerName is empty it
removes args from all containers, otherwise only from the named container, and
keep the note about using retry.RetryOnConflict for automatic retry handling;
ensure the summary line and any parameter descriptions reflect "arguments/args"
rather than "environment variables."
π§Ή Nitpick comments (1)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (1)
219-227: Consider rejectingUNSPECIFIEDas a valid processing strategy.The validation uses
pb.ProcessingStrategy_valuewhich includesUNSPECIFIED(value 0), but the help text and values.yaml documentation only mentionEXECUTE_REMEDIATIONandSTORE_ONLYas valid options. AllowingUNSPECIFIEDcould lead to undefined behavior downstream.π‘ Proposed fix to explicitly reject UNSPECIFIED
value, ok := pb.ProcessingStrategy_value[appCfg.processingStrategy] - if !ok { + if !ok || pb.ProcessingStrategy(value) == pb.ProcessingStrategy_UNSPECIFIED { return fmt.Errorf("invalid processingStrategy %q (expected EXECUTE_REMEDIATION or STORE_ONLY)", appCfg.processingStrategy) }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (9)
distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yamlhealth-monitors/csp-health-monitor/cmd/maintenance-notifier/main.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gotests/csp_health_monitor_test.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/kube.go
π§ Files skipped from review as they are similar to previous changes (4)
- health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
- tests/helpers/event_exporter.go
- tests/csp_health_monitor_test.go
- tests/event_exporter_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/helpers/kube.gohealth-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
π§ Learnings (5)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.godistros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml
𧬠Code graph analysis (1)
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (2)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(60-64)health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
NewEngine(70-86)
π Additional comments (7)
tests/helpers/kube.go (2)
2232-2273: LGTM!The
WaitForDaemonSetRolloutfunction correctly checks all the relevant DaemonSet status conditions (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) and follows the established patterns in this file with proper logging andrequire.Eventuallyusage.
2275-2360: LGTM!The
SetDeploymentArgsfunction correctly follows the established patterns fromSetDeploymentEnvVars, with proper retry-on-conflict handling and container targeting. ThesetArgsOnContainerhelper handles multiple argument formats (--flag=value,--flag value, boolean flags) appropriately.distros/kubernetes/nvsentinel/charts/csp-health-monitor/templates/deployment.yaml (1)
161-161: LGTM!The new
--processing-strategyargument correctly references.Values.processingStrategyand follows the established pattern of other container arguments in this deployment template.distros/kubernetes/nvsentinel/charts/csp-health-monitor/values.yaml (1)
52-57: LGTM!The new
processingStrategyconfiguration is well documented with clear descriptions of both modes. The defaultEXECUTE_REMEDIATIONmaintains backward compatibility with existing deployments. As per coding guidelines, the inline comments adequately document the configuration options.health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go (1)
61-61: LGTM!The new
processingStrategyfield and CLI flag are correctly added to the application configuration with an appropriate default value and descriptive help text.Also applies to: 75-76
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
58-67: LGTM!The
Enginestruct correctly includes the newprocessingStrategyfield with appropriate typepb.ProcessingStrategy.
69-86: LGTM!The
NewEngineconstructor correctly accepts the newprocessingStrategyparameter and initializes it on theEnginestruct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
π§Ή Nitpick comments (1)
tests/helpers/kube.go (1)
2362-2390: Missing container-not-found check for consistency.Unlike
SetDeploymentArgs(line 2306-2308) andRemoveDeploymentEnvVars(line 1259-1260), this function doesn't return an error when the specifiedcontainerNameis not found. This inconsistency could mask configuration errors in tests.β»οΈ Suggested fix for consistency
func RemoveDeploymentArgs( ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string, ) error { return retry.RetryOnConflict(retry.DefaultRetry, func() error { deployment := &appsv1.Deployment{} if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil { return err } if len(deployment.Spec.Template.Spec.Containers) == 0 { return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName) } + found := false + for i := range deployment.Spec.Template.Spec.Containers { container := &deployment.Spec.Template.Spec.Containers[i] if containerName != "" && container.Name != containerName { continue } + found = true + removeArgsFromContainer(container, args) } + if containerName != "" && !found { + return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName) + } + return c.Resources().Update(ctx, deployment) }) }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (2)
tests/csp_health_monitor_test.gotests/helpers/kube.go
π§ Files skipped from review as they are similar to previous changes (1)
- tests/csp_health_monitor_test.go
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/kube.go
π§ Learnings (2)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/helpers/kube.go
π Additional comments (4)
tests/helpers/kube.go (4)
2232-2273: LGTM!The DaemonSet rollout wait logic correctly checks all required conditions (DesiredNumberScheduled > 0, UpdatedNumberScheduled, NumberReady) and follows the established patterns in this file. Good use of
t.Helper()and informative logging.
2275-2312: LGTM!The function follows the established pattern of
SetDeploymentEnvVarsand correctly handles retry-on-conflict without error wrapping inside the retry block, as per coding guidelines. Good documentation in the godoc comment.
2314-2360: LGTM, logic handles multiple argument formats correctly.The function correctly handles the three common CLI argument styles (
--flag=value,--flag,--flag value). The slice mutation during iteration is safe since you break immediately after any modification.
2392-2417: LGTM!The removal logic correctly handles both
--flag=valueand--flag valuestyles. The slice bounds are safe since thej+1 < len(container.Args)check guards the two-element removal case.
8296232 to
ca69902
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π€ Fix all issues with AI agents
In @tests/csp_health_monitor_test.go:
- Around line 500-513: In the require.Eventually closure, fix the variable
shadowing and return logic: assign the found event to the predeclared
receivedEvent (replace using the local event only) so ValidateCloudEvent
receives the populated map (use the map[string]any from
FindEventByNodeAndCheckName), and after successful helpers.ValidateCloudEvent
return true instead of false so the Eventually loop can stop; update the closure
around FindEventByNodeAndCheckName, receivedEvent, and the ValidateCloudEvent
call accordingly.
- Around line 518-531: Teardown incorrectly sets the csp-health-monitor
maintenance-notifier container arg "--processing-strategy" to "STORE_ONLY" (same
as setup) instead of restoring the default; update the Teardown block that calls
helpers.SetDeploymentArgs for service "csp-health-monitor" / container
"maintenance-notifier" to either remove the "--processing-strategy" arg or set
it to the default value "EXECUTE_REMEDIATION" (use whichever helper exists,
e.g., a RemoveDeploymentArg or SetDeploymentArgs with "--processing-strategy":
"EXECUTE_REMEDIATION") so subsequent tests are not affected.
- Around line 431-435: SetDeploymentArgs return value is ignored; update the
test to capture its error and fail the test on error. Call
helpers.SetDeploymentArgs(ctx, client, "csp-health-monitor",
helpers.NVSentinelNamespace, "maintenance-notifier",
map[string]string{"--processing-strategy":"STORE_ONLY"}) into an err variable,
check if err != nil and call t.Fatalf or t.Fatalf("SetDeploymentArgs failed:
%v", err) before proceeding to helpers.WaitForDeploymentRollout so configuration
failures are reported immediately.
π§Ή Nitpick comments (1)
tests/helpers/kube.go (1)
2362-2390: Consider adding container existence validation for consistency.Unlike
SetDeploymentArgs,RemoveDeploymentArgsdoesn't validate that the specified container exists whencontainerNameis provided. While idempotent removal is reasonable, this inconsistency could mask configuration errors.β»οΈ Optional: Add container validation
+ found := false + for i := range deployment.Spec.Template.Spec.Containers { container := &deployment.Spec.Template.Spec.Containers[i] if containerName != "" && container.Name != containerName { continue } + found = true + removeArgsFromContainer(container, args) } + if containerName != "" && !found { + return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName) + } + return c.Resources().Update(ctx, deployment)
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/csp_health_monitor_test.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/helpers/kube.gotests/csp_health_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/csp_health_monitor_test.go
π§ Learnings (7)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/helpers/kube.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/csp_health_monitor_test.go
𧬠Code graph analysis (2)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)
health-monitors/csp-health-monitor/pkg/datastore/datastore.go (1)
Store(33-52)data-models/pkg/protos/health_event_grpc.pb.go (1)
PlatformConnectorClient(43-45)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(45-45)ProcessingStrategy(77-79)ProcessingStrategy(81-83)ProcessingStrategy(90-92)
tests/csp_health_monitor_test.go (8)
tests/helpers/csp_health_monitor.go (4)
CSPHealthMonitorTestContext(42-47)SetupCSPHealthMonitorTest(77-81)AddGCPInstanceIDAnnotation(256-277)WaitForCSPHealthMonitorPoll(404-410)health-monitors/csp-health-monitor/pkg/csp/aws/aws.go (1)
NewClient(115-198)health-monitors/csp-health-monitor/pkg/csp/gcp/gcp.go (1)
NewClient(133-217)tests/helpers/kube.go (7)
SetDeploymentArgs(2279-2312)NVSentinelNamespace(64-64)WaitForDeploymentRollout(984-1125)GetNodeByName(466-475)EnsureNodeConditionNotPresent(1797-1818)EventuallyWaitTimeout(61-61)WaitInterval(63-63)tests/helpers/csp_api_mock.go (2)
CSPGCP(45-45)CSPMaintenanceEvent(49-66)data-models/pkg/model/maintenance_event.go (2)
CSP(43-43)MaintenanceType(46-46)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(317-384)QuarantineAssertion(56-60)tests/helpers/event_exporter.go (3)
GetMockEvents(36-99)FindEventByNodeAndCheckName(221-254)ValidateCloudEvent(257-283)
π Additional comments (6)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)
58-67: LGTM!The
processingStrategyfield is properly added to the Engine struct, allowing the strategy to be threaded through event processing.
70-86: LGTM!The constructor properly accepts and initializes the
processingStrategyparameter, maintaining consistency with the existing pattern for other configuration parameters.
348-366: LGTM!The
ProcessingStrategyis correctly propagated from the Engine to the emittedHealthEvent. Based on learnings, this correctly allows both healthy and unhealthy events to use either processing strategy as needed.tests/helpers/kube.go (3)
2231-2273: LGTM!The
WaitForDaemonSetRollouthelper follows the established pattern fromWaitForDeploymentRollout, properly checkingDesiredNumberScheduled,UpdatedNumberScheduled, andNumberReadystatus fields.
2275-2360: LGTM!The
SetDeploymentArgshelper correctly handles multiple argument formats (--flag=value,--flag value, and boolean--flag). The use ofretry.RetryOnConflictfollows the coding guidelines for retry behavior within retry blocks.
2392-2417: LGTM!The
removeArgsFromContainerhelper correctly handles both--flag=valueand--flag valueargument styles. The slice operations are safe even at boundary conditions.
ca69902 to
e3d61ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π€ Fix all issues with AI agents
In @tests/csp_health_monitor_test.go:
- Around line 500-513: The Eventually block is incorrect: it declares
receivedEvent but never assigns it and calls helpers.ValidateCloudEvent with
that uninitialized variable, and it returns false even after successful
validation. Replace usage of receivedEvent with the event returned by
helpers.FindEventByNodeAndCheckName (use the local variable event), and change
the final return value to true after helpers.ValidateCloudEvent succeeds so the
require.Eventually stops on success; keep the initial not-found case returning
false as-is.
- Around line 518-531: The Teardown currently forces the processing strategy to
"STORE_ONLY" which repeats the setup change; instead restore the
original/default strategy: capture the current deployment args during Setup
(e.g., read and save the processing strategy for
"csp-health-monitor"/"maintenance-notifier" in Setup) and in the Teardown
closure call helpers.SetDeploymentArgs with that saved value rather than
hardcoding "STORE_ONLY"; alternatively, if you cannot read the original at
Setup, set the strategy back to the expected default "EXECUTE_REMEDIATION" in
the Teardown to avoid impacting subsequent tests.
In @tests/helpers/kube.go:
- Around line 2362-2390: RemoveDeploymentArgs currently skips when a specific
containerName is provided but not present, causing silent no-ops; update
RemoveDeploymentArgs to track whether a matching container was found (e.g., a
bool flag while iterating deployment.Spec.Template.Spec.Containers), and after
the loop if containerName != "" and no match was found return an error like
fmt.Errorf("container %s not found in deployment %s/%s", containerName,
namespace, deploymentName). Ensure this check mirrors the behavior in
SetDeploymentArgs and RemoveDeploymentEnvVars and keep calls to
removeArgsFromContainer and c.Resources().Update unchanged.
π§Ή Nitpick comments (1)
tests/csp_health_monitor_test.go (1)
424-425: Unused variableinjectedEventID.The variable is assigned on line 480 but never used afterward. If you don't need to track the event ID for status updates (unlike other tests), consider using
_for the first return value.β»οΈ Optional cleanup
- var injectedEventID string var testInstanceID stringAnd on line 480:
- injectedEventID, _, err = testCtx.CSPClient.InjectEvent(event) + _, _, err = testCtx.CSPClient.InjectEvent(event)
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/csp_health_monitor_test.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/csp_health_monitor_test.gotests/helpers/kube.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/csp_health_monitor_test.go
π§ Learnings (6)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/csp_health_monitor_test.go
π Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
𧬠Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
data-models/pkg/protos/health_event.pb.go (8)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)RecommendedAction(93-93)RecommendedAction(143-145)RecommendedAction(147-149)RecommendedAction(156-158)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
ProcessingStrategy(14-18)RecommendedAction(20-30)
π Additional comments (8)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (3)
58-67: LGTM!The
processingStrategyfield addition follows the existing struct pattern with proper typing using the protobuf enum type.
348-366: LGTM!The
ProcessingStrategyfield is correctly assigned from the engine's configuration. This approach ensures consistent processing strategy across all health events emitted by this engine, which aligns with the design where healthy events can legitimately useEXECUTE_REMEDIATIONwhen the Fault Quarantine Manager needs to act on them. Based on learnings, this is valid behavior for the NVSentinel system.
69-86: No action neededβvalidation occurs upstream.Upstream validation in
health-monitors/csp-health-monitor/cmd/maintenance-notifier/main.go(lines 219-223) validates theprocessingStrategystring flag against valid enum values before callingNewEngine. The flag default isEXECUTE_REMEDIATION, and only validated values (EXECUTE_REMEDIATIONorSTORE_ONLY) are passed to the constructor. The constructor correctly follows the pattern where the caller guarantees valid input.Likely an incorrect or invalid review comment.
tests/csp_health_monitor_test.go (1)
427-458: LGTM on setup logic.The setup correctly configures the
STORE_ONLYprocessing strategy via deployment args before running the test, and waits for the deployment rollout. The flow is appropriate sinceSetDeploymentArgstriggers a rollout that effectively restarts the pods.tests/helpers/kube.go (4)
2232-2273: LGTM!The
WaitForDaemonSetRolloutfunction correctly validates DaemonSet rollout completion by checking all three status conditions (DesiredNumberScheduled,UpdatedNumberScheduled,NumberReady). The implementation follows the established patterns in this file, uses appropriate test helpers (t.Helper()), and provides good logging for debugging test failures.
2275-2312: LGTM!The function follows the same well-established pattern as
SetDeploymentEnvVars(lines 1167-1200) with proper retry-on-conflict semantics. Error handling correctly returns unwrapped errors from API calls to preserve retry behavior while wrapping non-retryable errors with context. As per coding guidelines, errors withinretry.RetryOnConflictblocks are handled correctly.
2314-2360: LGTM!The
setArgsOnContainerhelper correctly handles the three common CLI argument formats (--flag=value,--flag value,--flagboolean). The slice manipulation at line 2343 is correct, and thebreakafter each modification prevents index corruption issues.Minor note: The heuristic at line 2338 (
!strings.HasPrefix(container.Args[j+1], "-")) to distinguish values from flags could theoretically misidentify negative numbers, but this is a reasonable trade-off for typical Kubernetes container arguments and matches common patterns in CLI parsing.
2392-2417: LGTM!The slice manipulation is correct and bounds-safe. The
breakafter each removal prevents index corruption from the shrinking slice. The function correctly handles all three argument formats and the bounds check at line 2406 (j+1 < len(container.Args)) properly guards against out-of-bounds access.Note: The function removes only the first occurrence of each flag, which is appropriate behavior for typical container argument configurations.
75309ae to
0fdaae2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In @tests/csp_health_monitor_test.go:
- Around line 504-517: Teardown is resetting the deployment to STORE_ONLY,
ignoring the SetDeploymentArgs error, and discarding the context return from
TeardownCSPHealthMonitorTest; change the Teardown body to call
helpers.SetDeploymentArgs with "--processing-strategy" set back to
"EXECUTE_REMEDIATION" and check its error with require.NoError(t, err) like in
setup, and capture and return the context value from
helpers.TeardownCSPHealthMonitorTest (i.e., ctx =
helpers.TeardownCSPHealthMonitorTest(ctx, t, c, testCtx); return ctx) so the
original processing strategy and context are properly restored; keep existing
calls to helpers.WaitForDeploymentRollout and require.NoError for c.NewClient().
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/csp_health_monitor_test.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/csp_health_monitor_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/csp_health_monitor_test.go
π§ Learnings (6)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/csp_health_monitor_test.go
π Learning: 2026-01-12T05:13:24.947Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:347-372
Timestamp: 2026-01-12T05:13:24.947Z
Learning: In the NVSentinel codebase, all HealthEvent creation paths across health monitors (syslog-health-monitor, kubernetes-object-monitor, csp-health-monitor) consistently set GeneratedTimestamp to timestamppb.New(time.Now()), ensuring it is never nil in production flows.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
𧬠Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (4)
health-monitors/csp-health-monitor/pkg/datastore/datastore.go (1)
Store(33-52)data-models/pkg/protos/health_event_grpc.pb.go (1)
PlatformConnectorClient(43-45)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(44-44)ProcessingStrategy(76-78)ProcessingStrategy(80-82)ProcessingStrategy(89-91)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: ko-build-test (event-exporter, .)
- GitHub Check: ko-build-test (health-monitors/kubernetes-object-monitor, .)
- GitHub Check: ko-build-test (node-drainer, .)
- GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
- GitHub Check: ko-build-test (fault-remediation, .)
- GitHub Check: ko-build-test (labeler, .)
- GitHub Check: ko-build-test (janitor, .)
- GitHub Check: ko-build-test (fault-quarantine, .)
- GitHub Check: ko-build-test (platform-connectors, .)
- GitHub Check: ko-build-test (health-events-analyzer, .)
- GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
- GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
- GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: E2E Tests (AMD64 + PostgreSQL)
- GitHub Check: E2E Tests (ARM64 + PostgreSQL)
- GitHub Check: E2E Tests (ARM64 + MongoDB)
- GitHub Check: E2E Tests (AMD64 + MongoDB)
- GitHub Check: CodeQL PR Analysis
π Additional comments (8)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
58-86: LGTM!The
processingStrategyfield is correctly added to theEnginestruct, properly initialized via the constructor, and appropriately used in the event mapping function. This cleanly threads the configurable strategy through the trigger engine lifecycle.
348-366: LGTM!The
ProcessingStrategyis correctly set on theHealthEventfrom the engine's configured strategy, replacing the previous hardcoded default. This aligns with the PR's objective to make the processing strategy configurable.tests/helpers/kube.go (4)
2232-2273: LGTM!The
WaitForDaemonSetRollouthelper correctly checks all relevant DaemonSet status fields (DesiredNumberScheduled,UpdatedNumberScheduled,NumberReady) and follows the established polling pattern used by other helpers in this file.
2275-2312: LGTM!The
SetDeploymentArgsfunction correctly usesretry.RetryOnConflictfor automatic retry handling as per coding guidelines, validates container existence, and properly delegates to the internal helper.
2314-2360: LGTM!The
setArgsOnContainerhelper comprehensively handles the three common argument styles (--flag=value,--flag value, and boolean--flag). The logic correctly updates existing args or appends new ones as needed.
2362-2425: LGTM!The
RemoveDeploymentArgsandremoveArgsFromContainerfunctions follow the same patterns established by theSet*counterparts and correctly handle removal of all argument styles, including properly removing both the flag and its value when using--flag valuestyle.tests/csp_health_monitor_test.go (2)
417-459: LGTM on test setup.The setup correctly configures the
STORE_ONLYprocessing strategy and properly initializes the test environment with GCP annotations.
461-502: LGTM on test assertion.The assess step correctly verifies that with
STORE_ONLYstrategy, the node is not cordoned and the CSPMaintenance condition is not applied.
Signed-off-by: Tanisha goyal <[email protected]>
0fdaae2 to
400aab2
Compare
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Tests
βοΈ Tip: You can customize this high-level summary in your review settings.